Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

130 ◾ Bioinformatics

1. Acquiring the raw data in FASTQ files.

2. Downloading the sequence of the reference genome, indexing (with samtools and

bwa) the reference sequence, and creating a dictionary (with Picard) for the reference

sequence.

3. Mapping the raw reads in FASTQ files to a reference genome using any of the aligner

(e.g., BWA) to generate aligned reads in SAM files.

4. Converting SAM files to BAM files using samtools.

5. Sorting and indexing the alignment in the BAM files using samtools.

6. Extracting the chromosome or the interval of interest, otherwise, skipping this step

if you are interested in the variants of the whole genome.

7. Removing or marking duplicate alignments in the BAM files using Picard and index-

ing the resulted BAM with samtools.

8. Adding RG header to the BAM files using Picard and indexing the resulted BAM files

with samtools.

9. Applying Base Quality Score Recalibration (BQSR) for detecting systematic errors

made by the sequencer.

10. Identification of the genotype at all positions per sample. This step involves

“HaplotypeCaller”, which takes a BAM file as input and outputs a GVCF file format

for each sample.

11. The next step in the GATK workflow is to consolidate the GVCF files of the multiple

samples and the variants are called across all samples in the cohort forming a single

VCF file. This step involves GenomicsDBImport and GenotypeGVCFs.

12. Separation of SNPs and InDels in separate VCF files using GATK SelectVariants.

4.2.2.2.1 Acquiring raw data

The raw sequence data usually is in FASTQ format. This time, we will use human raw data

from 1000 Genome project to demonstrate variant calling using GATK4 best practice. For

simplicity, instead of the whole genome, we will extract only the reads of chromosome

21 of 13 individuals including males and females from Africa, Asia, and Europe. Table

4.2 shows the NCBI SRA data from which the FASTQ files for chromosome 21 will be

extracted and sample information.

To keep the files organized, we can create a directory for this project “vcfgatk”, and

inside the directory, we will create a text file “ids.txt” that contains only the above ID col-

umn without a column header and each ID in a line. Then, while you are inside the project

directory, you can save the following bash script in a file “download.sh” and execute it as

“bash download.sh”: